Discovering mutation paths in sets of genetic sequences

CERTH / ITI – SequenceVisualAnalysis

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

       Georgios Petkos, CERTH / ITI, gpetkos@iti.gr  [PRIMARY contact]
       Konstantinos Moustakas, CERTH / ITI, moustak@iti.gr
       Dimitrios Tzovaras, CERTH / ITI, Dimitrios.Tzovaras@iti.gr

     

Tool(s):

For the purposes of the challenge, a custom application has been created using Processing. Two alternative but complementary visualizations of the relationships between the genetic sequences are used. In the first, sequences are allocated in 2D space according to their genetic similarity and resulting disease characteristics using multidimensional scaling (MDS) and in the second, a minimum spanning tree is computed in order to obtain the most likely mutation paths between all pairs of nodes. Among other interaction mechanisms, the user can select sequences from the main visualization and examine their sequences in an auxiliary visualization. For more details please see the 2 page summary.

The tool has been completely built by our team in CERTH / ITI and the only external dependency is the MDSJ library for performing the multidimensional scaling analysis.

 

Video:

 

Description of analysis.

 

 

ANSWERS:


MC3.1: What is the region or country of origin for the current outbreak?  Please provide your answer as the name of the native viral strain along with a brief explanation.

The region of origin is Nigeria_B. Using the (minimum spanning) tree representation, we have the  visualization of the following figure. The black dots represent the native sequences whereas the red dots represent outbreak sequences. All outbreak sequences belong in a separate subtree and the adjacent node is the Nigeria_B strain.

Similarly, using the multidimensional scaling representation (next figure), it is clear that the outbreak sequences make a tight cluster which is closest to the native sequence Nigeria_B. Therefore, since all outbreak sequences are much closer to the Nigeria_B sequence, it is safe to assume that the outbreak most likely originated in Nigeria_B.


MC3.2:  Over time, the virus spreads and the diversity of the virus increases as it mutates.  Two patients infected with the Drafa virus are in the same hospital as Nicolai.  Nicolai has a strain identified by sequence 583.  One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51.  Assume only a single viral strain is in each patient.  Which patient likely contracted the illness from Nicolai and why?  Please provide your answer as the sequence number along with a brief explanation.

The answer is 123. We use the tree representation and mark the sequences 583, 123 and 51. As it can be seen in the following figure the sequence 123 is only a single mutation step (link) away from 583, whereas 3 links separate 583 from 51. Therefore, the patient with the strain 123 is more likely to have contracted the illness from Nicolai. Additionally, the marked sequences are displayed below the main visualization (only the positions where at least one sequence differs from the rest are displayed)  and it can be seen that the sequence 123 better matches the sequence 583 than the sequence 51.

A similar result is obtained with the MDS based representation, as displayed in the following figure.


MC3.3:  Signs and symptoms of the Drafa virus are varied and humans react differently to infection.  Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them. 

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).  The mutations involve one or more base substitutions.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)

Answer:

A C, 268

G C, 211

A G, 222

 (please note that we start counting at 0)

Using the tree based visualization, we choose to display only symptom severity from the disease characteristics. Now, we are looking for the subtree with the most intense red colors. One is located on the left and is below the mutation from A to C in position 268. Interestingly, when highlighting this mutation, a second branch, with just another child is also highlighted, meaning the same mutation occurs in another point in the tree (see next figure). The occurence of another sequence with severe symptoms and the same mutation supports the assumption the mutation from A to C in position 268 significantly increases the symptom severity.

Similarly, it is easy to spot that another mutation that causes increased symptom is the transition from G to C in position 211 as it is displayed in the following picture.

 

In a similarly it is easy to spot the last mutation at position 222 (from A to G), where the red arrow is in the picture above. This subtree, instead of the one with the blue arrow is chosen, because it has a larger ratio of sequences with severe symptoms.


MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question.  To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important.  Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

Answer:

A T, 945

A G, 222

A G, 820

A similar process as in the previous question is followed. The aggregate characteristic is only displayed and again we are looking for the subtree with the most intense coloring. We can use both the tree and the MDS representation along with the tree connectivity. The following figure displays the visualization using MDS. As it can be seen, the first mutation which significantly increases the disease characteristics, from A to T in position 945, is easy to spot.

The other significant mutations are a bit difficult to see using the MDS representation and we subsequently switch to the regular tree representation. As it can be seen in the following figure, the one is the change from A to G in position 222, like in the previous question and the last one is the change from A to G in position 820 (the red arrow in the next figure)